-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Add default init container in workers to wait for GCS to be ready #973
[Feature] Add default init container in workers to wait for GCS to be ready #973
Conversation
5222472
to
139bc5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
There's a slight concern some users may require additional configuration (security policy, etc.) copied over to the init container.
I wonder if we should copy the entire container spec and replace the entry point -- that could have unforeseen consequences for some users, though.
( OSS is hard :) )
Examined the implementation of the btw, looks like |
… ready (ray-project#973) Add default init container in workers to wait for GCS to be ready
Why are these changes needed?
Currently, the init container logic is wrong. It waits for the head service rather than GCS server. The head service will be ready when the image pull finishes. The current retry logic is implemented by Ray internal.
kuberay/ray-operator/config/samples/ray-cluster.complete.yaml
Lines 124 to 129 in 71e260f
For example, add
command: ["sleep 180"]
in the headGroupSpec. Then, the head Pod command will besleep 180 && ulimit -n 65536; ray start ...
. To clarify, the GCS server requires a minimum of 120 seconds to become ready after the head service is ready. It exceed the timeout of the Ray internal retry mechanism, so the worker will fail.In this PR, we add a default init container to use
ray health-check
to check the status of GCS so that can prevent this issue. In addition, each init container must complete successfully before the next one starts. Hence, it is fine for us to have two init containers at this moment to keep backward compatibility. We will remove the original init container from sample YAML files after release 0.5.0.Related issue number
Closes #476
Checks